A Primal/Dual Stump Algorithm for Large Numerical Datasets

نویسنده

  • Patrick Haffner
چکیده

We demonstrate a stochastic gradient algorithm that can handle the very large number of stump features generated by considering every possible threshold over numerical, or continuous, features. Our problem is to classify data with continuous features, where small variations in the feature value can result in a different classification decision. Consider for instance packet statistics used for network analysis: features such as packet counts and durations need to be finely thresholded to achieve accurate decisions. To be used as input to vectorial classifiers, these features are often normalized between -1 and +1, or standardized to the unit variance: much of the separation power of the feature is then lost. A more powerful way to build a finely discriminant classifier looks at every possible partition of the input space that can separate training examples, and perform a weighted combination of these partitions. In practice, only coordinate-wise partitions are possible: given feature f and threshold θ, the data is split into two sets {f < θ} and {f ≥ θ}. State-of-the-art algorithms that combine partitions include classification trees and boosted stumps [4]. As a matter of fact, a recent comparative study suggests the best algorithm on continuous data is boosted trees [2]. One explanation is that the input space used by boosted trees or stumps is much larger than the space used by SVMs, and that the traditional kernel trick does not help much. Unfortunately, boosting and classification tree algorithms become very cumbersome to apply on very large datasets, as they require a pass through the entire data for each weak classifier to be added. On the other hand, SVMs have recently been shown to be amenable to online implementations [1, 5]. In this work, we use the SVM paradigm to describe regularized linear classifiers, but one could also apply the Maximum Entropy or the generalized Perceptron paradigms with very similar results. We show that SVM can handle a stump representation as rich as classification trees, and we demonstrate an algorithm that is mostly stochastic. The selection of stumps starts by looking, for each feature, at every possible threshold that partitions the training data in a different way, that is every different value the training data takes for this feature. This implies that each one of the n continuous feature can correspond to up to l stump features, where l is the number of training examples. The dimension of the resulting stump space can reach a nl dimension. This work shows how to deal with this explosive number of features by playing with alternative representations of the data and the weight vector.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An interior-point algorithm for $P_{ast}(kappa)$-linear complementarity problem based on a new trigonometric kernel function

In this paper, an interior-point algorithm  for $P_{ast}(kappa)$-Linear Complementarity Problem (LCP) based on a new parametric trigonometric kernel function is proposed. By applying strictly feasible starting point condition and using some simple analysis tools, we prove that our algorithm has $O((1+2kappa)sqrt{n} log nlogfrac{n}{epsilon})$ iteration bound for large-update methods, which coinc...

متن کامل

An Interior Point Algorithm for Solving Convex Quadratic Semidefinite Optimization Problems Using a New Kernel Function

In this paper, we&nbsp;consider convex quadratic semidefinite optimization problems and&nbsp;provide a primal-dual Interior Point Method (IPM) based on a new&nbsp;kernel function with a trigonometric barrier term. Iteration&nbsp;complexity of the algorithm is analyzed using some easy to check&nbsp;and mild conditions. Although our proposed kernel function is&nbsp;neither a Self-Regular (SR) fun...

متن کامل

ABS Solution of equations of second kind and application to the primal-dual interior point method for linear programming

 Abstract  We consider an application of the ABS procedure to the linear systems arising from the primal-dual interior point methods where Newton method is used to compute path to the solution. When approaching the solution the linear system, which has the form of normal equations of the second kind, becomes more and more ill conditioned. We show how the use of the Huang algorithm in the ABS cl...

متن کامل

Primal-dual path-following algorithms for circular programming

Circular programming problems are a new class of convex optimization problems that include second-order cone programming problems as a special case. Alizadeh and Goldfarb [Math. Program. Ser. A 95 (2003) 3-51] introduced primal-dual path-following algorithms for solving second-order cone programming problems. In this paper, we generalize their work by using the machinery of Euclidean Jordan alg...

متن کامل

Accelerated Primal-Dual Proximal Block Coordinate Updating Methods for Constrained Convex Optimization

Block Coordinate Update (BCU) methods enjoy low per-update computational complexitybecause every time only one or a few block variables would need to be updated among possiblya large number of blocks. They are also easily parallelized and thus have been particularlypopular for solving problems involving large-scale dataset and/or variables. In this paper, wepropose a primal-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008